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Abstract 

We propose a new way of incorporating temporal information present in videos into Spa¬ 
tial Convolutional Neural Networks (ConvNets) trained on images, that avoids training Spatio- 
Temporal ConvNets from scratch. We describe several initializations of weights in 3D Con¬ 
volutional Layers of Spatio-Temporal ConvNet using 2D Convolutional Weights learned from 
ImageNet. We show that it is important to initialize 3D Convolutional Weights judiciously in 
order to learn temporal representations of videos. We evaluate our methods on the UCF-101 
dataset and demonstrate improvement over Spatial ConvNets. 


1 Introduction 

Recognizing the action performed in videos is one of the most challenging problems in computer 
vision. Compared to images that only contain spatial information about objects present in one shot 
(frame), videos represent a continuous flow of information that describes the physics of the world 
that we live in. In these settings, learning the underlying temporal representations of information 
present in videos becomes an important challenge faced in order to recognize the action. 

Despite the fact that GPUs are getting faster and getting more memory every year, training Spatio- 
Temporal ConvNets on large scale video datasets such as Sports-1M Q and Facebook-380K IfTSTl 
from scratch is a very time consuming task, which requires nearly a month to complete training. At 
the same time, training Spatio-Temporal ConvNets on smaller datasets such as UCF-101 iTDI and 
HMDB [|9l leads to severe overfitting. Taking a ConvNet trained on ImageNet [ITTll and fine-tuning 
it on individual video frames solves the problem of overfitting, but gives unsatisfying solution 
because this model doesn’t learn temporal representations present in multiple frames. 

To tackle the above challenge, we propose several ways of initializing 3D convolutional weights, 
which leam temporal representations of videos, using 2D convolutional weights learned from im- 
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ages. This dramatically speeds up training of Spatio-Temporal ConvNets and reduces overfitting 
on relatively small video datasets. We found that in order to learn temporal representations present 
in videos and get improvements in accuracy, it is important to initialize weights in 3D convolu¬ 
tional layers judiciously. Otherwise, the Spatio-Temporal ConvNet remains stuck at the plateau 
that only extracts spatial information from videos and will not get improvements in accuracy. 
By appropriately initializing weights and using Composite LSTM that learned representations of 
video sequences lfl4Tl trained on labelled examples present in UCF-101 and unlabelled examples 
from Sports-1M, we managed to nearly match the current best classification accuracy [fTTl on RGB 
data extracted from UCF-101 which uses many additional tricks to improve their performance. 


2 Related Work 

Research in action recognition was mainly driven by advancements in object recognition in im¬ 
ages, where those approaches were extended or adapted to deal with videos. Traditional shallow 
approaches consisted of three main stages. Firstly, sparse spatio-temporal interest points were de¬ 
tected in videos and extracted using Histogram of Oriented Gradients (HOG) JT] and Histogram 
of Optical Flow (HOF) 0. Recently, Wang el al. IIT6II have proposed dense trajectories which 
became state-of-the art hand-crafted features for action recognition. Next, those features got com¬ 
bined into a fixed-sized vector description, such as Bag of Words (BoW) or Fisher Vectors (FV). 
Lastly, the standard classifier such as SVM was trained on BoW or FV represenation to distinguish 
among the classes of interest. This approach still has state-of-the art performance on UCF-101 and 
HMDB datasets. 

With the larger availability of labeled image data and advancements in parallel computing. Deep 
ConvNets 0 overshadowed traditional approaches in extracting features of images. Motivated 
by those results, researchers 0 and lfl5ll have created large scaled labelled video datasets and 
trained Deep Spatio-Temporal ConvNets on them, which were originally proposed by Ji et al. 
||6l . Even though their models had access to temporal information presented in videos, they didn’t 
perform better than simple Spatial ConvNet fine-tuned on image frames extracted from UCF-101 
and HMDB datasets 0, [fT5ll . Recently, Simonyan et al. ITl2l showed that Deep Early Fusion 
Spatio-Temporal ConvNet trained on dense optical flow extracted from videos significantly boosts 
the performance of fine-tuned spatial ConvNet trained on images. Their approach nearly matched 
state-of-the-art performance on UCF-101 dataset. Additionally, IflTTl . 0, 0 recently showed that 
spatio-temporal pooling of Deep ConvNet features gives additional improvement in performance. 


3 Transforming Features 

In this section we describe several ways of transforming 2D Convolutional Weights into 3D ones, 
without losing the spatial information learned by training Spatial ConvNet on ImageNet. Suppose 
that we have a 2D Convolutional Weight Matrix W (2f> ’ derived from training a Spatial ConvNet on 
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(a) IA (b) IS (c) ZWI (d) NWI 


Figure 1: Example values of a t for each initialization 

ImageNet. Our goal is to create a 3D Convolutional Weight Matrix W (:iD> with temporal dimension 
T using those weights. Each sub-matrix W t 3 , Vf e {1,..., T} has the same dimension as the 
original 2D Convolutional Weight Matrix. 

In a trained Spatial ConvNet, each layer expects an input from the layer below it to be within some 
specific range. It is important to initiali z e W ( - 3D ' 1 in a certain way, so that the output of the 3D 
Convolutional Layer remains to be approximately in the same range as the output of the originally 
learned 2D Convolutional Layer. In order to enforce this constraint, the sum of all sub-matrices of 
W ( 3D ) has to be equal to W^ 2D ^ at the initialization time, i.e. Y^t=i Wy (3D ' 1 = W^ 20 ^. 

Below we desribe four different kinds of initializations of W( 3D \ which we have considered. 

3.1 Initialization by Averaging. (IA) 

Since the consecutive image frames in videos share similarities in appearance, we would expect 
pj/( 2 £>) l0 ex t ra ct very similar spatial representations from each consecutive image frame. There¬ 
fore, we can initialize each sub-matrix w[' iD> to extract the same spatial representations present 
in its respective input feature map. This could be accomplished by diving each element of the 2D 
Convolutional weight matrix by the temporal size of the convolutional layer and setting all convo¬ 
lutional weights across temporal dimension of W (:iD> to the resulting matrix. More formally, it can 
be expressed as: 

W t (3D) = ^,Vf e 

3.2 Initialization by Scaling. (IS) 

This method can be viewed as a generalization to the method of initializing weights by averaging. 
Instead of dividing W (2D) by the same number T and hence making each Vfy 3/3) equal to each 
other, we can induce some diversity by setting each w[' iD> to W {2,)) divided by some random 
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constant. Any combination of the values of those constants could work, as long as the constraint 
described in section's satisfied. Therefore, becomes: 

W} 3D) — a t * W^ 2D \ where a t > 0 and Ylt=i a t — 1 

3.3 Zero Weight Initialization. (ZWI) 

It is natural to ask ourselves what would happen if we initialize one of wl' 3I)} to W <2I)) and initial¬ 
ize other sub-matrices to zero matrix. Would those sub-matrices initialized with zeros be able to 
leam to extract meaningful representations from the input even in spite of having limited training 
data ? Additionally, despite some differences in values, the distribution of weights is the same 
in each sub-matrix of in the above initializations. Because of that the network might get 

stuck at the plateau which was reached by learning spatial representation on images and as a result 
it might not learn the temporal representation presented in multiple frames. Motivated by these 
points, we explored the following initialization technique, that can be expressed as: 

w m = l wm ’ ifi = 1 

|0, otherwise. 

Note that, for any 1 < t < T, wj 3D > could be initialized to W^ 2D \ as long as all values of 
other sub-matrices are zero. However, in our experiments we found out that changing the order 
made no difference in resulting accuracy. 


3.4 Negative Weight Initialization. (NWI) 

We can extend the zero weight initialization by encouraging each sub-matrix Wj: 3D \ excluding 
the first one, to move even further from original values of W (2I)> . This could be accomplished by 
setting the values of sub-matrices w} 3D \ 2 < t < T to negative signed values of W {2D ' > divided 
by some constant. This woud make the absolute values of w { 3D> , 1 = 1 larger compared to other 
initializations. Again as in ZWI, changing the order of sub-matrices doesn’t give any change in 
resulting accuracy. It can be expressed as following: 

W i f ,D) = a t * W^ 2D \ where 


a t = 


2 T-1 
T ’ 

_ J_ 

T ’ 


if t — 1 
otherwise. 
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3.5 Architecture Details. 


TheConvNet with an architecture described here https : / / github . com/TorontoDeepLearning/ 
convnet/blob/master/examples/imagenet/CLS_net_20140801232522.pbtxt 

was trained on ImageNet ILSVRC-2012. It yielded 13.5% top-5 error on ILSVRC-2012 validation 
set. In this study, we did not focus on using the best ImageNet model. Rather, we chose one that 
was convenient and easy to work with and focused on studying the relative impact of our proposed 
method. When fine-tuning, we removed the softmax layer and reduced the number of units in the 
last fully connected hidden layer from 4096 to 2048. Otherwise, the performance of the model 
dropped by 6-7 %. We also used an aggressive dropout of 0.8 in both FC layers. The first two con¬ 
volutional layers were transformed into 3D Convolutional layers with temporal size of 3, stride of 
1 and temporal size of 2, stride of 1 respectively. The 3D pooling operation between those layers 
was perfromed on regions with temporal size of 2. The optimization in all convolutional layers 
started after 500 iterations in order to preserve good features learned by the model on ImageNet. 

Due to the limited size of the training set, other 3D Convolutional Networks with larger number of 
3D Convolutional Layers performed worse than this model. 


4 Evaluation and Discussion 

4.1 Dataset 

The evaluation is performed on first split of UCF-101 dataset. UCF-101 contains 13.2K videos (25 
frames/second) annotated into 101 classes, where each split contains 9.5K training videos. 


4.2 Effect of Different Initializations. 

The results of all transfer learning experiments are shown in Table 1. The results yielded many 
different conclusions. 

First, fine-tuning 3D ConvNet initialized by averaging or scaling yielded only small marginal im¬ 
provement compared to fine-tuning 2D ConvNet. For the initialization by scaling, the best results 
were yielded by using ag = a 3 = | and ai — \ for the first convolutional layer, and ag = a 2 = \ 
for the second convolutional layer. We think that since each sub-matrix across temporal dimension 
is initialized with similar weights which extract spatial features, the convolutional layer couldn’t 
move beyond this plateau. 

In contrast, despite the fact that all sub-matrices of W { ' ]1)> , except for one, were initialized with 
zero matrix, the model did surprisingly better than models initialized using averaging or scaling. 
This could be explained by the fact that, at the initialization time the first sub-matrix which extracts 
spacial representations encouraged other sub-matrices to learn to extract temporal representations. 
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Also, compared to other IA and IS, larger differences in values between first and other sub-matrices 
may have helped the model to learn to extract better temporal representations from the input. 
This might explain why the model initialized by Negative Weight Initialization yielded the best 
performance among them all. 

Finally, by averaging our softmax class probabilities with softmax class probabilities of Compos¬ 
ite LSTM, we managed to nearly match the current highest accuracy of deep learning based ap¬ 
proaches on RGB data. We think that our performance can be further increased by using different 
tricks described in iflTTl and using SVM fusion instead of averaging. The results are summarized 
in Table 2. 


Model 

Performance 

Spatial ConvNet 

71.8% 

Init. by Averaging (IA) 

72.0 % 

Init. By Scaling (IS) 

72.4 % 

Zero Weight Init. (ZWI) 

73.3 % 

Negative Weight Init. (NWI) 

73.9 % 


Table 1: Comparisons of different initial¬ 
ization techniques in Spatio-Temporal Con- 
vNet with simple Spatial ConvNet. 


Model 

Performance 

Spatial ConvNet [121 

72.8 % 

C3D (15) 

72.3 % 

C3D + fc6 [151 

76.4 % 

LRCN 0 

71.1 % 

Composite LSTM [14] 

75.8 % 

NWI + Composite LSTM 

78.3 % 

ConvNet Features [17] 

79.0 % 


Table 2: Comparisons with other state-of- 
the-art neural networks based approaches on 
RGB data. 


4.3 Two-Stream Spatio-Temporal ConvNets 

We further ran experiments and evaluated complete two-stream model on first split, which com¬ 
bines RGB and Optical Flow based Spatio-Temporal ConvNets. Optical Flow based ConvNet, 
similar to the one proposed by (121 . was trained on single frame optical flow and stacks of 10 opti¬ 
cal flows. This gave an accuracy of 72.2% and 77.5% respectively. Additionally we trained a slow 
fusion version of this model, with an slow fusion setup same as in dTJ and this model gave an accu¬ 
racy of 79.3%. We then combine the softmax scores of this optical flow based slow fusion model 
with the softmax scores of RGB based NWI + Composite LSTM model and obtain the accuracy of 

85.3 % on UCF-101. Also, despite the smaller accuracy of early fusion model compared to its slow 
fusion version, the combination of optical flow based early fusion model with NWI + Composite 
LSTM gave the same accuracy of 85.3 %. Esentially, despite better results of slow fusion model, 
it gives no additional performance improvement in final averaged scores. The comparison of our 
model with state-of-the-art action recognition models is summarized in Table 3. 
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(a) Initialization by Averaging 





(b) Initialization by Scaling 
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(c) Zero Weight Initialization 
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(d) Negative Weight Initialization 

Figure 2: 3D Convolutional Kernels W (:< ’ D> = (bF| 3D \ W^ D \ W^ U>) ) learned by the first 3D Con¬ 
volutional Layer initialized by IA, IS, ZWI and NWI respectively. In Spatio-Temporal ConvNet 
initialized by averaging or scaling, W^ D} and W^ ,)] look very similar to Wy iD) . Whereas in 
ConvNet initialized by NWI, and W^ D) look different from W[ ?,D] . Also, despite having 

limited labelled training video data, and iC| :!/)) look like edge and color blobs detectors in 

Spatio-Temporal ConvNet initialized by ZWI. 7 









































































































































































































































































































































Method 

Performance 

LRCN [3] 

82.9 % 

Two-Stream Convolutional Net [12) (split 1) 

87.0 % 

C3D + fc6 + iDT 031 

86.7 % 

ConvNet Features + iDT [17] 

89.7 % 

Multi-skip feature stacking [10] 

89.1 % 

Composite LSTM Model II14H (split 1) 

84.3 % 

Two-Stream Spatio-Temporal Convolutional Net (split 1) 

85.3 % 


Table 3: Comparisons of different action recognition models. 

5 Conclusions 

We proposed a new way of creating Spatio-Temporal ConvNets, by incorporating temporal infor¬ 
mation into Spatial ConvNet trained on images. Despite having limited labelled video training 
data, our model outperformed Spatio-Temporal ConvNets trained on large scale labelled video 
datasets. It is interesting to see, whether initializing Spatio-Temporal ConvNet with one of the ini¬ 
tialization techniques proposed here gives an improvement in performance on large scaled labelled 
video datasets. However, it is very challenging for us to process the datasets of such scale. 


References 

[1] N. Dalai and B. Triggs. Histograms of oriented gradients for human detection. In In CVPR, pages 
886-893, 2005. 

[2] N. Dalai, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appear¬ 
ance. In In ECCV, pages 428-441, 2006. 

[3] J. Donahue, L. A. Hendricks, S. Guadanama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar¬ 
rell. Long-term recurrent convolutional networks for visual recognition and description. CoRR , 
abs/1411.4389, 2014. 

[4] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional 
activation features. In In EECV, pages 392-407, 2014. 

[5] K. He, X. Zhang, S. Ren, , and J. Sun. Spatial pyramid pooling in deep convolutional networks for 
visual recognition. In In EECV, 2014. 

[6] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. 
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1 ):221—231, Jan 2013. 

[7] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video 
classification with convolutional neural networks. In CVPR, 2014. 

[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural 
networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on 
Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, 
Lake Tahoe, Nevada, United States., pages 1106-1114, 2012. 

[9] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Sene. HMDB: a large video database for human 
motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011. 


8 


















[10] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyond gaussian pyramid: Multi-skip feature 
stacking for action recognition. CoRR, abs/1411.6660, 2014. 

[11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, 
M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014. 

[12] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. 
In Advances in Neural Information Processing Systems, 2014. 

[13] K. Soomro, A. Roshan Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from 
videos in the wild. In CRCV-TR-12-01, 2012. 

[14] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations 
using lstms. CoRR, abs/1502.04681, 2015. 

[15] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3D: generic features for video 
analysis. CoRR, abs/1412.0767, 2014. 

[16] H. Wang and C. Schmid. Action recognition with improved trajectories. In IEEE International Con¬ 
ference on Computer Vision, Sydney, Australia, 2013. 

[17] S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov. Exploiting image-trained cnn 
architectures for unconstrained video classification. CoRR , abs/1503.04144, 2015. 


9 



